A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies

نویسنده

  • Jules J. Berman
چکیده

BACKGROUND Large biomedical data sets have become increasingly important resources for medical researchers. Modern biomedical data sets are annotated with standard terms to describe the data and to support data linking between databases. The largest curated listing of biomedical terms is the the National Library of Medicine's Unified Medical Language System (UMLS). The UMLS contains more than 2 million biomedical terms collected from nearly 100 medical vocabularies. Many of the vocabularies contained in the UMLS carry restrictions on their use, making it impossible to share or distribute UMLS-annotated research data. However, a subset of the UMLS vocabularies, designated Category 0 by UMLS, can be used to annotate and share data sets without violating the UMLS License Agreement. METHODS The UMLS Category 0 vocabularies can be extracted from the parent UMLS metathesaurus using a Perl script supplied with this article. There are 43 Category 0 vocabularies that can be used freely for research purposes without violating the UMLS License Agreement. Among the Category 0 vocabularies are: MESH (Medical Subject Headings), NCBI (National Center for Bioinformatics) Taxonomy and ICD-9-CM (International Classification of Diseases-9-Clinical Modifiers). RESULTS The extraction file containing all Category 0 terms and concepts is 72,581,138 bytes in length and contains 1,029,161 terms. The UMLS Metathesaurus MRCON file (January, 2003) is 151,048,493 bytes in length and contains 2,146,899 terms. Therefore the Category 0 vocabularies, in aggregate, are about half the size of the UMLS metathesaurus.A large publicly available listing of 567,921 different medical phrases were automatically coded using the full UMLS metatathesaurus and the Category 0 vocabularies. There were 545,321 phrases with one or more matches against UMLS terms while 468,785 phrases had one or more matches against the Category 0 terms. This indicates that when the two vocabularies are evaluated by their fitness to find at least one term for a medical phrase, the Category 0 vocabularies performed 86% as well as the complete UMLS metathesaurus. CONCLUSION The Category 0 vocabularies of UMLS constitute a large nomenclature that can be used by biomedical researchers to annotate biomedical data. These annotated data sets can be distributed for research purposes without violating the UMLS License Agreement. These vocabularies may be of particular importance for sharing heterogeneous data from diverse biomedical data sets. The software tools to extract the Category 0 vocabularies are freely available Perl scripts entered into the public domain and distributed with this article.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Unified Medical Language System (UMLS): integrating biomedical terminology

The Unified Medical Language System (http://umlsks.nlm.nih.gov) is a repository of biomedical vocabularies developed by the US National Library of Medicine. The UMLS integrates over 2 million names for some 900,000 concepts from more than 60 families of biomedical vocabularies, as well as 12 million relations among these concepts. Vocabularies integrated in the UMLS Metathesaurus include the NC...

متن کامل

طرح نقشه نمایی مفاهیم طبّ سنّتی ایران در ساختار ابراصطلاحنامه و شبکه معنایی«(UMLS) نظام زبان واحد پزشکی »

Introduction: This research was aimed to analyze mapping scheme of Traditional Iranian Medicine (TIM) with structure of common language of meta- thesaurus and Semantic network Unified Medical System Language (UMLS). The domain, location and relation of TIM in the UMLS is designed, and recitation of location and proportion of the TIM’s concepts are provided. Methods: This is a triphasic research...

متن کامل

The Uni®ed Medical Language System (UMLS): integrating biomedical terminology

The Uni®ed Medical Language System (http:// umlsks.nlm.nih.gov) is a repository of biomedical vocabularies developed by the US National Library of Medicine. The UMLS integrates over 2 million names for some 900 000 concepts from more than 60 families of biomedical vocabularies, as well as 12 million relations among these concepts. Vocabularies integrated in the UMLS Metathesaurus include the NC...

متن کامل

Evaluating lexical variant generation to improve information retrieval

Techniques for managing lexical variation constitute an integral part of information retrieval systems. We report on a series of experiments aimed at evaluating LVG, a lexical variant management tool which addresses the particular problems involved in matching health related vocabularies to concepts in the Unified Medical Language System (UMLS) Metathesaurus. Experiments conducted on data from ...

متن کامل

Automatic Classification and Visualization of UMLS Source Vocabularies through Semantic Group Profiles

The Unified Medical Language System® (UMLS) is a comprehensive terminology integration system designed to support the development of electronic information systems. The UMLS integrates 161 source vocabularies, though for a given purpose, a developer may not need every vocabulary. With the breadth of vocabularies available, there is a need for classifying the UMLS source vocabularies with respec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • BMC Medical Informatics and Decision Making

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2003